# Image-text retrieval
Fg Clip Large
Apache-2.0
FG-CLIP is a fine-grained vision and text alignment model that achieves global and region-level image-text alignment through two-stage training, enhancing fine-grained visual understanding ability.
Multimodal Alignment
Transformers English

F
qihoo360
538
3
Siglip2 So400m Patch14 384
Apache-2.0
SigLIP 2 is a vision-language model based on the SigLIP pre-training objective, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.
Image-to-Text
Transformers

S
google
622.54k
20
Siglip2 So400m Patch14 224
Apache-2.0
SigLIP 2 is an improved multilingual vision-language encoder based on SigLIP, enhancing semantic understanding, localization, and dense feature extraction capabilities.
Image-to-Text
Transformers

S
google
23.11k
0
Siglip2 Large Patch16 512
Apache-2.0
SigLIP 2 is an improved model based on SigLIP, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.
Text-to-Image
Transformers

S
google
4,416
8
Llm Jp Clip Vit Large Patch14
Apache-2.0
A Japanese CLIP model trained based on the OpenCLIP framework, trained on a dataset of 1.45 billion Japanese image-text pairs, supporting zero-shot image classification and image-text retrieval tasks
Text-to-Image Japanese
L
llm-jp
254
1
Llm Jp Clip Vit Base Patch16
Apache-2.0
Japanese CLIP model trained on OpenCLIP framework, supporting zero-shot image classification tasks
Text-to-Image Japanese
L
llm-jp
40
1
Siglip So400m Patch14 224
Apache-2.0
SigLIP is an improved multimodal model based on CLIP, employing a superior Sigmoid loss function, pre-trained on the WebLI dataset, and suitable for tasks such as zero-shot image classification and image-text retrieval.
Text-to-Image
Transformers

S
google
6,654
53
Siglip So400m Patch14 384
Apache-2.0
SigLIP is a vision-language model pre-trained on the WebLi dataset, employing an improved sigmoid loss function to optimize image-text matching tasks.
Image-to-Text
Transformers

S
google
6.1M
526
Siglip Large Patch16 384
Apache-2.0
SigLIP is a multimodal model pretrained on the WebLi dataset, utilizing an improved Sigmoid loss function, suitable for zero-shot image classification and image-text retrieval tasks.
Image-to-Text
Transformers

S
google
245.21k
6
Siglip Large Patch16 256
Apache-2.0
SigLIP is a vision-language model pre-trained on the WebLi dataset, utilizing an improved sigmoid loss function to enhance performance
Image-to-Text
Transformers

S
google
24.13k
12
Siglip Base Patch16 512
Apache-2.0
SigLIP is a vision-language model pretrained on the WebLi dataset, utilizing an improved sigmoid loss function, excelling in image classification and image-text retrieval tasks.
Text-to-Image
Transformers

S
google
237.79k
24
Siglip Base Patch16 384
Apache-2.0
SigLIP is a multimodal model pre-trained on the WebLi dataset, employing an improved sigmoid loss function, suitable for zero-shot image classification and image-text retrieval tasks.
Image-to-Text
Transformers

S
google
2,570
10
Siglip Base Patch16 256
Apache-2.0
SigLIP is a vision-language model pre-trained on the WebLi dataset, employing an improved Sigmoid loss function, excelling in image classification and image-text retrieval tasks.
Text-to-Image
Transformers

S
google
12.71k
5
Clip Flant5 Xl
Apache-2.0
A visual-language generation model fine-tuned for image-text retrieval tasks, improved based on google/flan-t5-xl
Text-to-Image
Transformers English

C
zhiqiulin
13.44k
2
Clip Flant5 Xxl
Apache-2.0
A vision-language generation model fine-tuned based on google/flan-t5-xxl, specifically designed for image-text retrieval tasks
Image-to-Text
Transformers English

C
zhiqiulin
86.23k
2
Siglip Base Patch16 224
Apache-2.0
SigLIP is a vision-language model pretrained on the WebLi dataset, utilizing an improved Sigmoid loss function to optimize image-text matching tasks
Image-to-Text
Transformers

S
google
250.28k
43
CLIP Convnext Xxlarge Laion2b S34b B82k Augreg Rewind
MIT
A CLIP ConvNeXt-XXLarge model trained on the LAION-2B dataset, implemented using the OpenCLIP framework, focusing on zero-shot image classification tasks.
Text-to-Image
C
laion
63
2
Featured Recommended AI Models